Search CORE

29 research outputs found

Outlier Dimensions that Disrupt Transformers are Driven by Frequency

Author: Dell'Orletta Felice
Drozd Aleksandr
Puccetti Giovanni
Rogers Anna
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlate with the frequencies of encoded tokens in pre-training data, and they also contribute to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotopicity in future models we need pre-training schemas that would better take into account the skewed token distributions

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Myths and Legends in High-Performance Computing

Author: Domke Jens
Drozd Aleksandr
Hoefler Torsten
Matsuoka Satoshi
Wahib Mohamed
Publication venue
Publication date: 24/10/2023
Field of study

In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore's law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment

arXiv.org e-Print Archive

At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache

Author: Chen Peng
Domke Jens
Drozd Aleksandr
Gerofi Balazs
Kodama Yuetsu
Matsuoka Satoshi
Mittal Sparsh
Pericàs Miquel
Podobas Artur
Vatai Emil
Wahib Mohamed
Zhang Lingqi
Publication venue
Publication date: 05/04/2022
Field of study

Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of LARC, a processor fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a board set of proxy-applications and benchmarks, we aim to reveal where HPC CPU performance could be circa 2028, and conclude an average boost of 9.77x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design

arXiv.org e-Print Archive